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The paper considers linear regression problems where the number 
of predictor variables is possibly larger than the sample size. The basic 
motivation of the study is to combine the points of view of model 
selection and functional regression by using a factor approach: it is 
assumed that the predictor vector can be decomposed into a sum of 
two uncorrelated random components reflecting common factors and 
specific variabilities of the explanatory variables. It is shown that the 
traditional assumption of a sparse vector of parameters is restrictive 
in this context. Common factors may possess a significant influence on 
the response variable which cannot be captured by the speciflc effects 
of a small number of individual variables. We therefore propose to 
include principal components as additional explanatory variables in 
an augmented regression model. We give flnite sample inequalities for 
estimates of these components. It is then shown that model selection 
procedures can be used to estimate the parameters of the augmented 
model, and we derive theoretical properties of the estimators. Finite 
sample performance is illustrated by a simulation study. 

1. Introduction. The starting point of our analysis is a high-dimensional 
linear regression model of the form 

(1.1) Yi = (3^y.i+ei, z = l,...,n, 

where (Yi,Xj), i = 1, . . . ,n, are i.i.d. random pairs with 1^ G R and Xj = 
{Xii, . . . , Xip)^ G W . We will assume without loss of generality that E(Xjj) = 
for all J = 1, . . . ,p. Furthermore, /3 is a vector of parameters in W and 
{£i)i=i^,,,^ri are centered i.i.d. real random variables independent with X, 
with Var(ej) = o"^. The dimension p of the vector of parameters is assumed 
to be typically larger than the sample size n. 
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Roughly speaking, model (1.1) comprises two main situations which have 
been considered independently in two separate branches of statistical lit- 
erature. On one side, there is the situation where Xj represents a (high- 
dimensional) vector of different predictor variables. Another situation arises 
when the regressors are p discretizations (e.g., at different observations 
times) of a same curve. In this case model (1.1) represents a discrete ver- 
sion of an underlying continuous functional linear model. In the two setups, 
very different strategies for estimating /3 have been adopted, and underly- 
ing structural assumptions seem to be largely incompatible. In this paper 
we will study similarities and differences of these methodologies, and we 
will show that a combination of ideas developed in the two settings leads to 
new estimation procedures which may be useful in a number of important 
applications. 

The first situation is studied in a large literature on model selection in 
high-dimensional regression. The basic structural assumptions can be de- 
scribed as follows: 

• There is only a relatively small number of predictor variables with |/3j| > 
which have a significant infiuence on the outcome Y . In other words, the 
set of nonzero coefficients is sparse, S := 7^ 0} ^p. 

• The correlations between different explanatory variables Xij and X^, j 7^ 
I, are "sufficiently" weak. 

The most popular procedures to identify and estimate nonzero coefficients 
/3j are Lasso and the Dantzig selector. Some important references are Tib- 
shirani (1996), Meinshausen and Biihlmann (2006), Zhao and Yu (2006), 
van de Geer (2008), Bickel, Ritov and Tsybakov (2009), Candes and Tao 
(2007) and Koltchinskii (2009). Much work in this domain is based on the 
assumption that the columns {Xy, . . . , Xnj)'^ , j = 1,. . . ,p, of the design ma- 
trix are almost orthogonal. For example, Candes and Tao (2007) require that 
"every set of columns with cardinality less than S approximately behaves 
like an orthonormal system." More general conditions have been introduced 
by Bickel, Ritov and Tsybakov (2009) or Zhou, van de Geer and Biilhmann 
(2009). The theoretical framework developed in these papers also allows one 
to study model selection for regressors with substantial amount of correla- 
tion, and it provides a basis for the approach presented in our paper. 

In sharp contrast, the setup considered in the literature on functional 
regression rests upon a very different type of structural assumptions. We 
will consider the simplest case that Xy = Xi {tj ) for random functions Xi G 
L^([0, 1]) observed at an equidistant grid tj = ^. Structural assumptions 
on coefficients and correlations between variables can then be subsumed as 
follows: 

• /3j := where j3{t) G -^^^([0, 1]) is a continuous slope function, and as 

p ^ 00, PjXi, = ^x,{t,) ^ £ m^t) dt. 
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• There are very high correlations between explanatory variables Xij = 
Xi{tj) and Xii = Xi{ti), j / /. As p-^ oo, con:{Xi{tj), Xi{tj+rn)) 1 for 
any fixed m. 

Some important applications as well as theoretical results on functional lin- 
ear regression are, for example, presented in Ramsay and Dalzell (1991), 
Cardot, Ferraty and Sarda (1999), Cuevas, Febrero and Fraiman (2002), Yao, 
Miiller and Wang (2005), Cai and Hall (2006), Hall and Horowitz (2007), 
Cardot, Mas and Sarda (2007) and Crambes, Kneip and Sarda (2009). Ob- 
viously, in this setup no variable Xij = Xi(tj) corresponding to a specific 
observation at grid point tj will possess a particulary high infiuence on Yi, 
and there will exist a large number of small, but nonzero coefficients (3j of 
size proportional to 1 /p. One may argue that dimensionality reduction and 
therefore some underlying concept of "sparseness" is always necessary when 
dealing with high-dimensional problems. However, in functional regression 
sparseness is usually not assumed with respect to the coefficients f3j, but the 
model is rewritten using a "sparse" expansion of the predictor functions Xi . 

The basic idea relies on the so-called Karhunen-Loeve decomposition 
which provides a decomposition of random functions in terms of functional 
principal components of the covariance operator of Xi. In the discretized 
case analyzed in this paper this amounts to consider an approximation of 
Xi by the principal components of the covariance matrix 5] = E(XjX^). 
In practice, often a small number k of principal components will suffice to 
achieve a small L^-error. An important points is now that even if p> n the 
eigenvectors corresponding to the leading eigenvalues /ii, . . . , /.i^ of S can be 
well estimated by the eigenvectors (estimated principal components) tpj. of 
the empirical covariance matrix S. This is due to the fact that if the pre- 
dictors Xij represent discretized values of a continuous functional variable, 
then for sufficiently small k the eigenvalues //i, . . . , /.tfc will necessarily be of 
an order larger than and will thus exceed the magnitude of purely ran- 
dom components. From a more general point of view the underlying theory 
will be explained in detail in Section 4. 

Based on this insight, the most frequently used approach in functional 

regression is to approximate Xj w Ylir=i{'^r'^i)'^r terms of the first k 
estimated principal components if^i,. . . ,ipki ^^^Y the approximate 

model Yi ~ X^^=i + Here, k serves as smoothing parameter. The 

new coefficients ol are estimated by least squares, and pj = ^^^-^ciril^rj- 
Resulting rates of convergence are given in Hall and Horowitz (2007). 

The above arguments show that a suitable regression analysis will have to 
take into account the underlying structure of the explanatory variables Xij . 
The basic motivation of this paper now is to combine the points of view of the 
above branches of literature in order to develop a new approach for model 
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adjustment and variable selection in the practically important situation of 
strongly correlated regressors. More precisely, we will concentrate on factor 
models by assuming that the Xj G can be decomposed in the form 

(1.2) Xi = W, + Zi, i = l,...,n, 

where Wj and Zj are two uncorrelated random vectors in MP. The random 
vector Wj is intended to describe high correlations of the Xij while the 
components Zij, j = 1, . . . ,p, of Zj are uncorrelated. This implies that the 
covariance matrix S of Xj adopts the decomposition 

(1.3) i] = r + *, 

where T = E(WiWf), while * is a diagonal matrix with diagonal entries 
vav{Zij), j = I,..., p. 

Note that factor models can be found in any textbook on multivariate 
analysis and must be seen as one of the major tools in order to analyze 
samples of high-dimensional vectors. Also recall that a standard factor model 
is additionally based on the assumption that a finite number k of factors 
suffices to approximate Wj precisely. This means that the matrix T only 
possesses k nonzero eigenvalues. In the following we will more generally 
assume that a small number of eigenvectors of T suffices to approximate Wj 
with high accuracy. 

We want to emphasize that the typical structural assumptions to be found 
in the literature on high-dimensional regression are special cases of (1.2). If 
Wj = and thus Xj = Z j , we are in the situation of uncorrelated regressors 
which has been widely studied in the context of model selection. On the 
other hand, Zj = and thus Xj = Wj reflect the structural assumption of 
functional regression. 

In this paper we assume that Wij as well as Zij represent nonnegligible 
parts of the variance of Xij. We believe that this approach may well describe 
the situation encountered in many relevant applications. Although standard 
factor models are usually considered in the case p<^n, (1.2) for large values 
of p may be of particular interest in time series or spatial analysis. Indeed, 
factor models for large p with a finite number k of nonzero eigenvalues of T 
play an important role in the econometric study of multiple time series and 
panel data. Some references are Forni and Lippi (1997), Forni et al. (2000), 
Stock and Watson (2002), Bernanke and Boivin (2003) and Bai (2003, 2009). 

Our objective now is to study linear regression (1.1) with respect to ex- 
planatory variables which adopt decomposition (1.2). Each single variable 
Xij, j = 1, . . . ,p, then possesses a specific variability induced by Zij and may 
thus explain some part of the outcome Yi. One will, of course, assume that 
only few variables have a significant influence on Yi which enforces the use 
of model selection procedures. 
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On the other hand, the term Wij represents a common variabihty. Corre- 
sponding principal components quantify a simultaneous variation of many 
individual regressors. As a consequence, such principal components may pos- 
sess some additional power for predicting Yi which may go beyond the effects 
of individual variables. A rigorous discussion will be given in Section 3. We 
want to note that the concept of "latent variables," embracing the common 
influence of a large group of individual variables, plays a prominent role in 
applied, parametric multivariate analysis. 

These arguments motivate the main results of this paper. We propose to 
use an "augmented" regression model which includes principal components 
as additional explanatory variables. Established model selection procedures 
like the Dantzig selector or the Lasso can then be applied to estimate the 
nonzero coefficients of the augmented model. We then derive theoretical 
results providing bounds for the accuracy of the resulting estimators. 

The paper is organized as follows: in Section 2 we formalize our setup. 
We show in Section 3 that the traditional sparseness assumption is restric- 
tive and that a valid model may have to include principal components. The 
augmented model is thus introduced with an estimation procedure. Section 
4 deals with the problem how accurately true principal components can be 
estimated from the sample Xi , . . . , X„ . Finite sample inequalities are given, 
and we show that it is possible to obtain sensible estimates of those com- 
ponents which explain a considerable percentage of the total variance of 
all Xij, j = 1, . . . ,p. Section 5 focuses on theoretical properties of the aug- 
mented model, while in Section 6 we present simulation results illustrating 
the finite sample performance of our estimators. 

2. The setup. We study regression of a response variable 1^ on a set of 
i.i.d. predictors Xj G M^, i = 1, . . . ,n, which adopt decomposition (1.2) with 



E{Xij) = E{W^J) = E{Zi,) = 0, E{ZijZ^k) = 0, E{WijZu) = 0, E{Z^jZikZuZ^m) = 



for all j,k,l,m £ {!,..., p}, j ^ {k,l,m}. Throughout the following sec- 
tions we additionally assume that there exist constants Dq, < oo and 



(A.l) 0<Di<a]< D2MXfj} < Do, E{Zf^) < D3 for all j = l,...,p. 



Recall that 5] = E(XjX?') is the covariance matrix of Xj with I] = F + 'I', 
where T = E(WjW?") and is a diagonal matrix with diagonal entries 
cr|, j = 1, . . . ,p. We denote as 5] = ^ J2^=i -^i-^f empirical covariance 
matrix based on the sample Xj, i = 1, . . . ,n. 

Eigenvalues and eigenvectors of the standardized matrices -F and -S will 
play a central role. We will use Ai > A2 > • • • and /ii > /i2 > • • • to denote 
the eigenvalues of ^F and ^S, respectively, while •0]^, -025 • • ■ ^^'^ <5i, ^2, • • • 




assumption 



(A.l) is satisfied for all p: 
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denote corresponding orthonormal eigenvectors. Note that all eigenvectors 
of is and S (or and T) are identical, while eigenvalues differ by the 
factor 1/p. Standardization is important to establish convergence results for 
large p, since the largest eigenvalues of S tend to infinity as p — >• oo. 

Prom a conceptional point of view we will concentrate on the case that p is 
large compared to n. Another crucial, qualitative assumption characterizing 
our approach is the dimensionality reduction of Wj using a small number 
k<^p oi eigenvectors (principal components) of |r such that (in a good ap- 



proximation) Wj X]r=i iir'^r- We also assume that Vx = ^ X]j=i ^{^"ij) > 
Piy = I X]j=i lE(^jj) ^ p- Then all leading principal components of |r cor- 
responding to the k largest eigenvalues explain a considerable percentage of 
the total variance ofWi and Xj. 

Indeed, if W, = Ylt=i ^ir'^ri necessarily have Ai > » ^ and /^i > 
Ai > ^ » i. Then tr(ir) = YJ'r=i = ^ Ei=iIE(W,2), and the first prin- 
cipal component of |r explains a considerable proportion (i/p)^p'^ e(H/?'.) — 

^ ^ i of the total variance of Wj. 

We want to emphasize that this situation is very different from the setup 
which is usually considered in the literature on the analysis of high-dimensional 
covariance matrices; see, for example, Bickel and Levina (2008). It is then 
assumed that the variables of interest are only weakly correlated and that 
the largest eigenvalue jii of the corresponding scaled covariance matrix 
is is of order i. This means that for large p the first principal com- 
ponent only explains a negligible percentage of the total variance of Xj, 

,_, , , ^y./^i ^ = Oi-). It is well known that in this case no consistent es- 

(i/p)E^=i]E(x/j.) ^p' 

timates of eigenvalues and principal components can be obtained from an 
eigen-decomposition of ^S. 

However, we will show in Section 4 that principal components which are 
able to explain a considerable proportion of total variance can be estimated 
consistently. These components will be an intrinsic part of the augmented 
model presented in Section 3. 

We will need a further assumption which ensures that all covariances 
between the different variables are well approximated by their empirical 
counterparts: 



(A. 2) There exists a Cq < oo such that 



(2.1) sup 



(2.2) sup 



1 " 

i=l 
1 " 

~ X] ~ cov(Zjj, Zji) 



n 
1=1 



<Cr 



<Co 



logp 



n 



logp 



n 
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(2.3) 



(2.4) 



sup 

l<j,l<p 



sup 

l<j,l<p 



1 " 

n ^-^ 

1=1 

- ^ XijXii - cov{Xij , Xii) 



i=l 



<Co 



logp 



n 



logp 



n 



hold simultaneously with probability A(n,p) > 0, where A{n,p) — )• 1 as n,p — )• 
oo, ^^0. 

The following proposition provides a general sufficient condition on ran- 
dom vectors for which (A. 2) is satisfied provided that the rate of convergence 
of to is sufhciently fast. 

Proposition 1. Consider independent and identically distributed ran- 
dom vectors Vj € MP, i = 1, . . . ,n, such that for j = 1, . . . ,p, E(^j) = and 



(2.5) 



E(e"l^'^l) <Ci 



for positive constants a and Ci with moreover E(l^^) < Ci. Then, for any 



positive constant Co such that cl^^ < ^^gf and Ci < |C7oe'^V^on/(iogp)^k^ 

/ n 

P sup -\2Vi,Vu-coY{V^j,Vii) <C 



(2.6) 



logp 



Note that as n,p— )• oo the right-hand side of (2.6) converges to 1 pro- 
vided that Co is chosen sufficiently large and that p/e" =0(1) for some 
4/5 < T < 1. Therefore, assumption (A. 2) is satisfied if the components of 
the random variables Xij possess some exponential moments. For the spe- 
cific case of centered normally distributed random variables, a more precise 
bound in (2.6) may be obtained using Lemma 2.5 in Zhou, van de Geer 
and Biilhmann (2009) and large deviations inequalities obtained by Zhou, 
Lafferty and Wassermn (2008). In this case it may also be shown that for 
sufficiently large Co events (2.1)-(2.4) hold with probability tending to 1 as 
p — )• cxD without any restriction on the quotient logp/n. Of course, the rate 

\J^-^^ in (2.1)-(2.4) depends on the tails of the distributions: it would be 
possible to replace this rate with a slower one in case of heavier tails than 
in Proposition 1. Our theoretical results could be modified accordingly. 



3. The augmented model. Let us now consider the structural model (1.2) 
more closely. It implies that the vector Xj of predictors can be decomposed 
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into two uncorrelated random vectors Wj and Z j . Each of these two compo- 
nents separately may possess a significant influence on the response variable 
Yi. Indeed, if Wj and Zj were known, a possibly substantial improvement 
of model (1.1) would consist in a regression of Yi on the 2p variables Wj 
and Zj 

p p 

(3.1) Y, = ^/3*W^,+Y,|3jZ^,+ei, i = l,...,n, 

with different sets of parameters /3j and /3j, j = 1, . . . ,p, for each contribu- 
tor. We here again assume that Sj, i = 1, . . . ,n, are centered i.i.d. random 
variables with Var(ej) = cj^ which are independent of Wij and Zij. 

By definition, Wij and Zij possess substantially different interpretations. 
Zij describes the part of Xij which is uncorrelated with all other variables. A 
nonzero coefficient /3j 7^ then means that the variation of Xij has a specific 
effect on Ij. We will of course assume that such nonzero coefficients are 
sparse, ^{j\/3j 7^ 0} < 5 for some 5 <C p. The true variables Zij are unknown, 
but with /3** = P*j — /3j model (3.1) can obviously be rewritten in the form 

p p 

(3.2) "^i = Y. ^r^ii + ^i^^i + ^ = 1, • • ■ , 

The variables Wij are heavily correlated. It therefore does not make any 
sense to assume that for some j € {l,...,p} any particular variable Wij 
possesses a specific influence on the predictor variable. However, the term 
'^^=1 (3j*Wij may represent an important, common effect of all predictor 
variables. The vectors Wj can obviously be rewritten in terms of principal 
components. Let us recall that Ai > A2 > • • • denote the eigenvalues of the 
standardized covariance matrix of Wj, if = iE(WjWf) and ip^,ip2,--- 
corresponding orthonormal eigenvectors. We have 

p p p 

W, = ^(<W,)Vr and Y.f3**Wij = ^a;i^JWi), 

r=l j=l r=l 

where a* = X^j=i (3j*4'rj- As outlined in the previous sections we now assume 
that the use of principal components allows for a considerable reduction of 
dimensionality, and that a small number of leading principal components 
will suffice to describe the effects of the variable Wj. This may be seen as 
an analogue of the sparseness assumption made for the Zij. More precisely, 
subsequent analysis will be based on the assumption that the following aug- 
mented model holds for some suitable k>l: 

k p 

(3.3) Yi = '^ariir + '^PjXij+ei, 

r=l j=l 
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where = ij^r I VpK and Or 



/pXrO* . The use of 
■ pXr, r 



instead of ip 

is motivated by the fact that Var('j/7^Wj) = pXr, r = 1, . . . ,k. Therefore the 
S^ir are standardized variables with Var(,^ji) = • • • = Var(^ii) = 1. Fitting an 
augmented model requires us to select an appropriate k as well as to de- 
termine sensible estimates of ■ ■ ,S,ik- Furthermore, model selection pro- 
cedures like Lasso or Dantzig have to be applied in order to retrieve the 
nonzero coefficients ar, r = 1, . . . ,k, and j = 1, . . . ,p. These issues will 
be addressed in subsequent sections. 

Obviously, the augmented model may be considered as a synthesis of the 
standard type of models proposed in the literature on functional regression 
and model selection. It generalizes the classical multivariate linear regression 
model (1.1). If a A;-factor model holds exactly, that is, rank(r) = k, then the 
only substantial restriction of (3.1)-(3.3) consists in the assumption that Yi 
depends linearly on Wi and Zi. 

We want to emphasize, however, that our analysis does not require the 
validity of a fc-factor model. It is only assumed that there exists "some" Zi 
and Wi satisfying our assumptions which lead to (3.3) for a sparse set of 
coefficients /3j. 

3.1. Identifiability. Let /3 = (/3i, . . . , /5p)'^ and o; = (ai, . . . , a^)'^. Since 
i/'ri r = 1, . . . ,k, are eigenvectors of T we have E('0^Wj'0j'Wj) = for all 
r,s = 1, . . . ,p, r ^ s. By assumption the random vectors Wj and Zj are un- 
correlated, and hence E('j/>^WjZjj) = for all r, j = 1, . . . ,p. Furthermore, 
M{ZiiZij) = for all l^ j.li the augmented model (3.3) holds, some straight- 
forward computations then show that under (A.l) for any alternative set of 
coefficients /3* = (/3J", . . . ,/5*)^, a* = {a\, . . --.aXf , 



E 



p 



X., 



(3.4) 



r=l 



+ Di\\P-(3*\\i 



We can conclude that the coefficients r = 1, . . . ,k and Pj, j = 1, . . . ,p, in 
(3.3) are uniquely determined. 

Of course, an inherent difficulty of (3.3) consists of the fact that it con- 



tains the unobserved, "latent" variables S,ir = ipr^i/^ 



To study this 



problem, first recall that our setup imposes the decomposition (1.3) of the 
covariance matrix S of Xj. If a factor model with k factors holds exactly, 
then the T possesses rank k. It then follows from well-established results 
in multivariate analysis that if k <p/2 the matrices T and ^ are uniquely 
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identified. If Ai > A2 > • • • > Afc > 0, tlien also il^i, . . . are uniquely deter- 
mined (up to sign) from the structure of T. 

However, for large p, identification is possible under even more general 
conditions. It is not necessary that a /c-factor model holds exactly. We only 
need an additional assumption on the magnitude of the eigenvalues of 
defining the k principal components of Wj to be considered. 



(A. 3) The eigenvalues of are such that 



mm 

j,l<k,j^l 



|Aj - Ail > v{k), 



min Aj > v{k) 



for some 1 > v{k) > with pv{k) > QD2. 

In the following we will qualitatively assume that A; <C p as well as v{k) ^ 
More specific assumptions will be made in the sequel. Note that eigen- 
vectors are only unique up to sign changes. In the following we will always 
assume that the right "versions" are used. This will go without saying. 



Theorem 1. Let £*. := and = I„ - Y]^,. 'tpiil^T ■ Under assump- 
Hons (A.l) and (A. 2) we have for all r = 1, . . . ,k, j = 1, . . . ,p and all k,p 
satisfying (A.3); 



(3.5) 



(3.6) 



(3.7) 



(3.8) E 



D2 

Xr \ < ■ 



'r\\2 



< 



m^r-CV)<— + 



p 

2D2 

pv{k) ' 
pv{k) ' 

D2 , (8Ai + l)L>i 



E 



.r=l 



r=l 




PHr p^v{kY^r 



2(Ih^i8Xi + l)Di 



\Pfir p'^v{kY^r 



For small p, standard factor analysis uses special algorithms in order to 
identify xj)^. The theorem tells us that for large p this is unnecessary since 
then the eigenvectors 5r of provide a good approximation. The predictor 
it of iir possesses an error of order 1/ The error decreases as p increases, 
and £* thus yields a good approximation of £,ir if p is large. Indeed, if p — )■ 00 
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[for fixed fir, v{k), Di and D2] then by (3.7) we have E{[(,ir - C*.]^) 0. 
Furthermore, by (3.8) the error in predicting J2r=i '^rCir + Sj=i Pj-^ij by 

Ylr=i '^riir + X]j=i l^j^ij Converges to zero as p — ;> 00. 

A crucial prerequisite for a reasonable analysis of the model is sparseness 
of the coefficients Note that if p is large compared to n, then by (3.8) the 
error in replacing by is negligible compared to the estimation error 
induced by the existence of the error terms £i.\i k <^p and 7^ 0} <C p, 

then the true coefficients a,, and /3j provide a sparse solution of the regression 
problem. 

Established theoretical results [see Bickel, Ritov and Tsybakov (2009)] 
show that under some regularity conditions (validity of the "restricted eigen- 
value conditions") model selection procedures allow to identify such sparse 
solutions even if there are multiple vectors of coefficients satisfying the nor- 
mal equations. The latter is of course always the case if p > n. Indeed, we 
will show in the following sections that factors can be consistently estimated 
from the data, and that a suitable application of Lasso or the Dantzig- 
selector leads to consistent estimators S,., /3j satisfying sup^ ja^ — S^l — )-p 0, 
supj \f5j — — )-p 0, as n,p^ 00. 

When replacing by ^*., there are alternative sets of coefficients lead- 
ing to the same prediction error as in (3.8). This is due to the fact that 
^ir ~ Si=i l/pp^ • However, all these alternative solutions are nonsparse and 
cannot be identified by Lasso or other procedures. In particular, it is easily 
seen that 



(3.9) 



k p p 

r=l 7=1 7=1 



k 

with 15 —pj + ^ar- 



5rj 



=1 



By (3.6) all values are of order l/{pv{k)). Since = 1, this implies 

that many are nonzero. Therefore, if a,- 7^ for some r £ {1, . . . ,k}, then 
{j\f3j^ 7^ 0} contains a large number of small, nonzero coefficients and is 
not at all sparse. If p is large compared to n no known estimation procedure 
will be able to provide consistent estimates of these coefficients. 

Summarizing the above discussion we can conclude: 

(1) If the variables Xij are heavily correlated and follow an approximate 
factor model, then one may reasonably expect substantial effects of the com- 
mon, joint variation of all variables and, consequently, nonzero coefficients 
I3j and in (3.1) and (3.3). But then a "bet on sparsity" is unjustifiable 
when dealing with the standard regression model (1.1). It follows from (3.9) 
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that for large p model (1.1) holds approximately for a nonsparse set of co- 
efficients /3f^, since many small, nonzero coefficients are necessary in order 
to capture the effects of the common joint variation. 

(2) The augmented model offers a remedy to this problem by pooling 
possible effects of the joint variation using a small number of additional 
variables. Together with the familiar assumption of a small number of vari- 
ables possessing a specific influence, this leads to a sparse model with at 
most k + S nonzero coefficients which can be recovered from model selection 
procedures like Lasso or the Dantzig-selector. 

(3) In practice, even if (3.3) only holds approximately, since a too-small 
value of k has been selected, it may be able to quantify at least some im- 
portant part of the effects discussed above. Compared to an analysis based 
on a standard model (1.1), this may lead to a substantial improvement of 
model fit as well as to more reliable interpretations of significant variables. 

3.2. Estimation. For a pre-specified k > 1 we now define a procedure 
for estimating the components of the corresponding augmented model (3.3) 
from given data. This obviously specifies suitable procedures for approximat- 
ing the unknown values as well as to apply subsequent model selection 
procedures in order to retrieve nonzero coefficients and /3j, r = 1, . . . ,k, 
j = 1, . . . ,p. A discussion of the choice of k can be found in the next section. 

Recall from Theorem 1 that for large p the eigenvectors . . . , ip), of 
are well approximated by the eigenvectors of the standardized covariance 
matrix -12. This motivates us to use the empirical principal components of 
n in order to determine estimates of i/',, and ^j^. Theoretical sup- 
port will be given in the next section. Define Ai > A2 > • • • as the eigenvalues 
of the standardized empirical covariance matrix = ^ X^"=i ^J^ii while 

il)i,ip2, ■ ■ ■ are associated orthonormal eigenvectors. We then estimate by 

Cir = 'fi\^i/\/p^r, r = l,...,k,i = l,...,n. 

When replacing ^jj. by in (3.3), a direct application of model selec- 
tion procedures does not seem to be adequate, since and the predictor 
variables Xij are heavily correlated. We therefore rely on a projected model. 
Consider the projection matrix on the orthogonal space of the space spanned 
by the eigenvectors corresponding to the k largest eigenvalues of |S 

k 
r=l 

Then model (3.3) can be rewritten for i = 1, . . . , n, 

(3.10) Yi = y^arZr + y^^i (PfcXjj +ei + ei, 

h '((i/-)Er=i(p.xo|)V2 



FACTOR MODELS AND VARIABLE SELECTION 



13 



where = a,. + y pKY7j=ii^rj/3j, /3j = I]r=i(Pfc^O|)^^^ and Si = 
"^j—i OiriCir — ^ir)- It will be shown in the next section that for large n and 
p the additional error term e Ccin be assumed to be reasonably small. _ 
In the following we will use Xj to denote the vectors with entries Xij := 

, ,J:^'''^j}^ — ,„ . Furthermore, consider the (A; + p)-dimensional vector 
((i/")ELi(PfcX02)i/2 

of predictors $i := {^n, . . . Xn, . . . ,Xip)'^. The Gram matrix in model 
(3.10) is a block matrix defined as 



■ J 







V 



where Ik is the identity matrix of size k. Note that the normalization of the 
predictors in (3.10) implies that the diagonal elements of the Gram matrix 
above are equal to 1. _ _ 

Arguing now that the vector of parameters 6 := (Si, . . . , 5^, . . . , f3p)'^ 
in model (3.10) is {k + S')-sparse, we may use a selection procedure to re- 
cover/estimate the nonnull parameters. In the following we will concentrate 
on the Lasso estimator introduced in Tibshirani (1996). For a pre-specified 
parameter p > 0, an estimator is then obtained as 

(3.11) e = argmin i||Y - *0||2 + 2p\\e\\i, 



$ being the n x (k + p)-dimensional matrix with rows ^j. We can alterna- 
tively use the Dantzig selector introduced in Candes and Tao (2007). 

Finally, from 6, we define corresponding estimators for a^, r= l,...,k, 
and (3j, j = 1, . . . ,p, in the unprojected model (3.3). 

Pi = , 1 = 1 , . . . , p, 

' ((i/n)EILi(PfeX.),2)V2' 

and 



Or 



4. High-dimensional factor analysis: Theoretical results. The following 
theorem shows that principal components which are able to explain a con- 
siderable proportion of total variance can be estimated consistently. 

For simplicity, we will concentrate on the case that n as well as p> ^Jn 
are large enough such that 
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(A.4) C7o(logp/n)i/2 > ^ and ^;(A;) > epa/p + Co(logp/n)i/2). 

Theorem 2. Under assumptions (A.1)-(A.4) and under events (2.1)- 
(2.4 ) we have for all r = 1, . . . ,k and all j = 1, . . . ,p, 

(4.1) |A,-V| < — + Co(logp/n)^/2 

P 

(4.2) ||^.-^.lb<2 ^-/^ + y-)^^ 

(4.3) V.^- < < 



(4.4) 



2^Doj-Di^Doj-Di 
~ pAr ~ pv{k) ' 

^ ^ Do + Co{logp/ny/^ 

^ 6 ZJo + Co(logp/7i)V^ 
~ 5 pt;(fc) 



Theorem 2 shows that for sufficiently large p {p> \/n) the eigenvalues and 
eigenvectors of |S provide reasonable estimates of A^ and i/?^ for r = 1, . . . ,k. 
Quite obviously it is not possible to determine sensible estimates of all p 
principal components of Following the proposition it is required that 

Ar as well as fir be of order at least \J^^^- Any smaller component cannot 
be distinguished from pure "noise" components. Up to the logp-term this 
corresponds to the results of Hall and Hosseini-Nasab (2006) who study the 
problem of the number of principal components that can be consistently 
estimated in a functional principal component analysis. 

The above insights are helpful for selecting an appropriate A; in a real 
data application. In tendency, a suitable factor model will incorporate k 
components which explain a large percentage of the total variance of Xj, 
while Afc+i is very small. If for a sample of high-dimensional vectors Xj 
a principal component analysis leads to the conclusion that the first (or 
second, third, . . .) principal components explains a large percentage of the 
total (empirical) variance of the observations, then such a component cannot 
be generated by "noise" but reflects an underlying structure. In particular, 
such a component may play a crucial role in modeling a response variable 
Yi according to an augmented regression model of the form (3.3). 

Bai and Ng (2002) develop criteria of selecting the dimension A; in a high- 
dimensional factor model. They rely on an adaptation of the well-known AIC 
and BIC procedures in model selection. One possible approach is as follows: 
Select a maximal possible dimension fcmax and estimate = p Sj=i '^j W 



FACTOR MODELS AND VARIABLE SELECTION 



15 



= ^ Er=i Ej=i(^ii - Er=r (^'^Xi)V^^j)2. Then determine an estimate 
k by minimizing 

^ n p / K \2 /_|_\ 

(4-5) -EEh^''^-E(^^'Wn +-^2^^ logmin{n,p} 

^ j=l j=l \ r=l / \ / 

over K= l,...,fcmax- Bai and Ng (2002) show that under some regularity 
conditions this criterium (as weh as a number of alternative versions) pro- 
vides asymptotically consistent estimates of the true factor dimension k as 
T-oo. In our context these regularity conditions are satisfied if (A.l)- 
(A.4) hold for all n and p, sup j^pK^Zfj) < oo and if there exists some i?o > 
such that Xk> Bq> 0, for all n,p. 

Now recall the modified version (3.10) of the augmented model used in 
our estimation procedure. The following theorem establishes bounds for the 
projections (PfcXj)j as well as for the additional error terms £i. Let = 
Ip — Ej=i '^j'^I^J denote the population version of P^. 

Theorem 3. Assume (A.l) and (A. 2). There then exist constants Mi, 
M2, M3 < 00, such that for all n,p,k satisfying (A. 3) and (A. 4), all j,l £ 



(A ^ V"^-P Y ^2 ^ ^2 kn ^/^yiogp 

i=l ^ ' 



(4.7) 



ix;(p.x.),^-.i 



n 
1=1 



<E((PfeWi)^) + M2 



2\ n/r ^^^VlogP 



^;(A:)3/2 

hold with probability A{n,p), while 

^ n / k \2 

- E = - E E(^- - ^-)«- 

1=1 1=1 \r=l / 

(4.8) 

~ f(A:)3 \ n p 
holds with probability at least A{n,p) — ^. Here, Osum = Er=i '^r- 



Note that if Xj satisfies a fc-dimensional factor model, that is, if the rank 
of ip is equal to k, then P^Wj = 0. The theorem then states that for 

large n and p the projected variables (PfcXj)j, j = 1, . . . ,k, "in average" 
behave similarly to the specific variables Zij. Variances will be close to a? = 
Var(Zjj). 
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5. Theoretical properties of the augmented model. We come back to 
model (3.3). As shown in Section 3.2, the Lasso or the Dantzig selector may 
be used to determine estimators of the parameters of the model. Identifica- 
tion of sparse solutions as well as consistency of estimators require structural 
assumptions on the explanatory variables. The weakest assumption on the 
correlations between different variables seems to be the so-called restricted 
eigenvalue condition introduced by Bickel, Ritov and Tsybakov (2009); see 
also Zhou, van de Geer and Biilhmann (2009). 

We first provide a theoretical result which shows that for large n,p the 
design matrix of the projected model (3.10) satisfies the restricted eigenvalue 
conditions given in Bickel, Ritov and Tsybakov (2009) with high probability. 
We will additionally assume that n,p are large enough such that 

(A.5) D^/2>M, '\l]^r 

where Mi is defined as in Theorem 3. 

Let Jo denote an arbitrary subset of indices, Jq C {1, . . . ,p} with | Jo| < 
k + S. For a vector a € R'^"'"^, let aj^ be the vector in M''"'"^ which has the 
same coordinates as a on Jq and zero coordinates on the complement Jq 
of Jo- We define in the same way ajg. Now for k + S < {k + p) /2 and for 
an integer m> k + S , S + m<p, denote by Jm the subset of {1, . . . ,k + p} 
corresponding to m largest in absolute value coordinates of a outside of Jq, 
and define Jo^m := Jo U Jm- Furthermore, let (x)+ = max{x,0}. 

Proposition 2. Assume (A.l) and (A. 2). There then exists a constant 
Mi<oo, such that for alln,p,k,S, k + S< {k+p)/2, satisfying (A.3)-(A.5), 
and cq = 1, 3 



At(fc + 5, + S, Cq) 



mm mm 



JoC{l,...,fc+p}:|J()|<fe+5A^O: ||Ajc||i<co||AjJ|i ||Ajgj^^g||2 

(5.1) 

Di 8ik + S)coM4A;2n-i/2^/i^ \ 1/2 



^ ' Do + Con-y^^/Egp v{k){Di - Mikv{kY/'^n-^l^^\ogp) 

= ■ 'Kn,p{k,S,Co) 

holds with probability A{n,p). 

Asymptotically, if n and p are large, then K„^p(A;, S, cq) > 0, cq = 1, 3, pro- 
vided that k, S and l/v{k) are sufficiently small compared to n,p. In this 
case the proposition implies that with high probability the restricted eigen- 
value condition RE(A; + S,k + S, cq) of Bickel, Ritov and Tsybakov (2009) 
[i.e., K{k + S,k + S,cq) > 0] is satisfied. The same holds for the conditions 
RE(/c + S, Co) which require K{k + 5, cq) > 0, where 
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n{k + 5, Co) 



mm mm 



JoC{l,...,fc+p}: |Jo|<fc+5A^O: || A jc ||i <co |1 A |1 1 ll^Jolb 

> n{k + S*, A; + S*, Co). 

The following theorem now provides bounds for the estimation error 
and the prediction loss for the Lasso estimator of the coefficients of the 
augmented model. It generalizes the results of Theorem 7.2 of Bickel, Ritov 
and Tsybakov (2009) obtained under the standard linear regression model. 
In our analysis merely the values of n{k + 5, co) for cq = 3 are of interest. 
However, only slight adaptations of the proofs are necessary in order to 
derive generalizations of the bounds provided by Bickel, Ritov and Tsybakov 
(2009) for the Dantzig selector (co = 1) and for the L'^ loss, 1 < g < 2. In 
the latter case, K{k + 5, cq) has to be replaced by ft;(/c + 5, A; + 5, co). In the 
following, let Mi and M3 be defined as in Theorem 3. 

Theorem 4. Assume (A.l), (A. 2) and suppose that the error terms £i 
in model (3.3) are independent Af{0,a'^) random variables with cr^ > 0. Now 
consider the Lasso estimator defined by (3.11) with 



sum 

^=^^v^^+^;(^v^' 

where A > 2V^, M5 is a positive constant and Osum — X^r-=i I'^A- 

If M5 < 00 is sufficiently large, then for all n,p,k, k -\- S < {k + p)/2, 
satisfying (A.3)~(A.5) as well as K.n,p{k, S,3) > 0, the following inequalities 
hold with probability at least A{n,p) — {p + k)~^^^^: 

^ lQ{k + S) 



IS,. — Olr\ < 



(5.2) 



=1 



f k{Do + Con~^/^Vl^y/^ 

^ {Di - Mi(A:n-i/2^/I^/.;;(A:)i/2))i/2 

/CON V^|«_/9|< I6{k + S) 

[0.6) 2^^\pj Pjl- ^2 1/2^/1^/^(^)1/2)) 1/2^' 

where k = K{k + 5, 3) . Moreover, 

^ n / k p / k p 

i=l \r=l 1=1 \r=l 7=1 

(5.4) 

^ 32(fc + 5) ^2 I 2kal^M3nogp ^ v{kf 
~ v{k)^ \ n p 

holds with probability at least A{n,p) — (p + k)"^"^"^ — ^. 
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Of course, the main message of the theorem is asymptotic in nature. If 
n,p tend to infinity for fixed values of k and S, then the Li estimation error 
and the prediction error converge at rates ylogp/n and logp/n + 
respectively. For values of k and S tending to infinity as the sample size 
tends to infinity, the rates are more complicated. In particular, they depend 
on how fast v{k) converges to zero as k ^ oo. Similar results hold for the 
estimators based on the Dantzig selector. 

Remark 1. Note that Proposition 2 as well as the results of Theorem 
4 heavily depend on the validity of assumption (A.l) and the corresponding 
value < Di < inf.,- (t|, where (t| = var(Zjj). It is immediately seen that the 
smaller the Z?i the smaller the value of Hi{k + S,k + 5, cq) in (5.1). This 
means that all variables Xij = Wij + Zjj, j = l,...,p have to possess a 
sufficiently large specific variation which is not shared by other variables. 
For large p this may be seen as a restrictive assumption. In such a situation 
one may consider a restricted version of model (3.3), where variables with 
extremely small values of a'j are eliminated. But for large n,p we can infer 

from Theorem 3 that a small value of ^ X]r=i(f*fc-^«)j indicates that also 
cr| is small. Hence, an extension of our method consists of introducing some 
threshold -Dthresh > and discarding all those variables Xij , j G { 1 , . . . , p} , 
with ^ X^iLi(F*fe^«)j < -^thresh- A precise analysis is not in the scope of the 
present paper. 

Remark 2 . If ai = • • • = a/j = the augmented model reduces to the 
standard linear regression model (1.1) with a sparse set of coefficients, 
tJ{il/3j 7^ 0} < 5 for some S <p. An application of our estimation procedure 
is then unnecessary, and coefficients may be estimated by traditional model 
selection procedures. Bounds on estimation errors can therefore be directly 
obtained from the results of Bickel, Ritov and Tsybakov (2009), provided 
that the restricted eigenvalue conditions are satisfied. But in this situation a 
slight adaptation of the proof of Proposition 2 allows us to establish a result 
similar to (5.1) for the standardized variables X*j ■= X^j /{^^^^-^^Xfj)^/'^ . 
Define X* as the n x p-matrix with generic elements X*-. When assum- 



ing (A.l), (A.2) as well as Di - SCon'^/'^^yiogp > 0, then for S < p/2 the 
following inequality holds with probability A{n,p): 

k{S, S, Co) 



(5.5) 



JoC{l,...,p}:|Jo|<5A7^0:|lA,c||i<co||A,/„|li 



mm mm 



[A^(l/n)Er=iX*XfA]V^ 
IIAjo.sIb 
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where cq = 1,3. Recall, however, from the discussion in Section 3.1 that 
ai = • • • = afc = is a restrictive condition in the context of highly correlated 
regressors. 

6. Simulation study. In this section we study the finite sample per- 
formance of the estimators discussed in the proceeding sections. We con- 
sider a factor model with k = 2 factors. The first factor is ipij = 1/ y/p, 
j = 1,. . . ,p, while the second factor is given by il)2j = l/^/Pj J = 1; ■ • ■ 5^/2, 
and 11^2] = — I/a/P) 3 =p/2 + 1, . . . ,p. For different values of n,p,ai,a2 and 
< Ai < 1, < A2 < 1 observations (Xj,yi) with var(Xjj) = 1, j = 1, . . . ,p, 
are generated according to the model 

(6.1) Xij = ^/p\'liil'^|)lj + y/p>^Ci2'4'2j + Zij, 

p 

(6.2) Yi = ai^a + aaCii + Yl ^i^iJ + ' 

i=i 

where ~ iV(0, 1), r = 1,2, Zij ~ N{0, 1 - Ai - A2), and et ~ N{0,a^) are 
independent variables. Our study is based on 5 = tt{i|/3j 7^ 0} = 4 nonzero 
/3-coefficients whose values are (3io = 1, f320 = 0.3, (32i = —0.3 and = — 1, 
while the error variance is set to o"^ = 0.1. 

The parameters of the augmented model with k = 2 are estimated by using 
the Lasso-based estimation procedure described in Section 3.2. The behavior 
of the estimates is compared to the Lasso estimates of the coefficients of 
a standard regression model (1.1). All results reported in this section are 
obtained by applying the LARS-package by Hastie and Efron implemented 
in R. All tables are based on 1,000 repetitions of the simulation experiments. 
The corresponding R-code can be obtained from the authors upon request. 

Figure 1 and Table 1 refer to the situation with Ai = 0.4, A2 = 0.2, ai = 1 
and a2 = —0.5. We then have var(Xij) = 1, j = 1, . . . ,p, while the first and 
second factor explain 40% and 20% of the total variance of Xij , respectively. 

Figure 1 shows estimation results of one typical simulation with n = p = 
100. The left panel contains the parameter estimates for the augmented 
model. The paths of estimated coefficients f3j for the 4 significant variables 
(black lines), the 96 variables with /3j = (red lines), as well as of the un- 
transformed estimates cir (blue lines) of , = 2, are plotted as a function 
of p. The four significant coefficients as well as ai and 02 can immediately 
been identified in the figure. The right panel shows a corresponding plot of 
estimated coefficients when Lasso is directly applied to the standard regres- 
sion model (1.1). As has to be expected by (3.9) the necessity of compen- 
sating the effects of 01,02 by a large number of small, nonzero coefficients 
generates a general "noise level" which makes it difficult to identify the four 
significant variables in (6.2). The penalties p in the figure as well as in sub- 
sequent tables have to be interpreted in terms of the scaling used by the 
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Augmented model: coefficient estimates 



Standard Lasso: coefficient estimates 




Fig. 1. Paths of Lasso estimates for the augmented model (left panel) and the standard 
linear model (right panel) in dependence of p; black — estimates of nonzero Pj ; red — esti- 
mates of coefficients with Pj — 0; blue — Sr. 

Table 1 

Estimation errors for different sample sizes (Xi = 0.4, A2 = 0.2, ai = 1, 02 = —0.5 ) 



Sample sizes Parameter estimates Prediction 



n 


P 


J] \ar — a.r\ 




Opt. p 


Sample 


Exact 


Opt. p 








Lasso applied to au 


jmented model: 








50 


50 


0.3334 


0.8389 


4.53 


0.0498 


0.1004 


1.76 


1.55 


100 


100 


0.2500 


0.5774 


6.84 


0.0328 


0.0480 


3.50 


3.29 


250 


250 


0.1602 


0.3752 


12.27 


0.0167 


0.0199 


7.55 


7.22 


500 


500 


0.1150 


0.2752 


18.99 


0.0096 


0.0106 


12.66 


12.21 


5,000 


100 


0.0378 


0.0733 


48.33 


0.0152 


0.0154 


27.48 


26.74 


100 


2,000 


0.2741 


0.8664 


10.58 


0.0420 


0.0651 


5.42 


5.24 


n 


P 




El3r--/3.| 




Sample 


Exact 


Opt. p 








Lasso applied to standard linear ref 


jression model: 






50 


50 


2.2597 


1.8403 




0.0521 


0.1370 


0.92 




100 


100 


2.2898 


1.9090 




0.0415 


0.0725 


1.90 




250 


250 


2.3653 


1.7661 




0.0257 


0.0345 


4.12 




500 


500 


2.4716 


1.7104 




0.0174 


0.0207 


6.87 




5,000 


100 


0.5376 


1.5492 




0.0161 


0.0168 


10.14 




100 


2,000 


3.7571 


2.2954 




0.0523 


0.1038 


3.17 





LARS-algorithm and have to be multiplied with 2/n in order to correspond 
to the standardization used in the proceeding sections. 

The upper part of Table 1 provides simulation results with respect to 
the augmented model for different sample sizes n and p. In order to access 
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the quality of parameter estimates we evaluate Ylr=i\'^r — «r| as well as 

X]j=i l/5r — Pr\ at the optimal value of p, where the minimum of Ylr=i \ — 

'^A + Sj=i \l^r — /3r| is obtained. Moreover, we record the value of p where 
the minimal sample prediction error 

^ n / 2 p / 2 p 

(6-3) -EE + E f^^^^j - E ^r^^r + E ^^^^J 

i=l \r=l j=l \r=l j=l 

ia attained. For the same value p we also determine the exact prediction 
error 

/ 2 p / 2 ^ p ^ 

(6.4) E I ^ arS,n+l,r + E f^J^^+'^d " I E ^rCn+l,r + E 
\r=l j=l \r=l j=l 

for a new observation X„_|_i independent of Xi, . . . ,Xn. The columns of 
Table 1 report the average values of the corresponding quantities over the 
1,000 replications. To get some insight into a practical choice of the penalty, 
the last column additionally yields the average value of the parameters p 
minimizing the Cp-statistics. Cp is computed by using the R-routine "sum- 
mary.lars" and plugging in the true error variance cr^ = 0.1. We see that 
in all situations the average value of p minimizing Cp is very close to the 
average p providing the smallest prediction error. The penalties for optimal 
parameter estimation are, of course, larger. 

It is immediately seen that the quality of estimates considerably increases 
when going from n = p = 50 to n = p = 500. An interesting result consists 
of the fact that the prediction error is smaller for n = p = 500 than for 
n = 5,000, p = 100. This may be interpreted as a consequence of (3.8). 

The lower part of Table 1 provides corresponding simulation results with 
respect to Lasso estimates based on the standard regression model. In ad- 
dition to the minimal error X]r=i l/^^" ~ I^A in estimating the parameters 
Pj of (6.2) we present the minimal Li-distance Y^^r=i\Pr ~ l^r^'\^ where 
Pi^, . . . , /3p ^ is the (nonsparse) set of parameters minimizing the population 
prediction error. Sample and exact prediction errors are obtained by straight- 
forward modifications of (6.3) and (6.4). Quite obviously, no reasonable pa- 
rameter estimates are obtained in the cases with p>n. Only for n = 5,000, 
p = 100, the table indicates a comparably small error Ylr=i \ f^r — f^r^l- The 
prediction error shows a somewhat better behavior. It is, however, always 
larger than the prediction error of the augmented model. The relative dif- 
ference increases with p. 

It was mentioned in Section 4 that a suitable criterion to estimate the 
dimension k of an approximate factor model consists in minimizing (4.5). 
This criterion proved to work well in our simulation study. Recall that the 
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Table 2 

Estimation errors under different setups (n= 100, p = 250^ 



Parameter estimates Prediction 



Ai 


A2 




a2 


\ar — Ctrl 


j:\0r -f3r\ 


Opt. p 


Sample 


Exact 


Opt. p 










Lasso applied to augmented model: 








0.06 


0.03 


1 


-0.5 


0.3191 


0.6104 


10.37 


0.0670 


0.2259 


2.46 


0.2 


0.1 


1 


-0.5 


0.2529 


0.5335 


8.19 


0.0414 


0.0727 


3.92 


0.4 


0.2 


1 


-0.5 


0.2500 


0.6498 


7.86 


0.0319 


0.0454 


4.35 


0.6 


0.3 


1 


-0.5 


0.2866 


1.1683 


8.46 


0.0273 


0.0350 


4.56 


0.06 


0.03 








0.0908 


0.4238 


7.42 


0.0257 


0.0311 


4.62 


0.2 


0.1 








0.1044 


0.4788 


7.64 


0.0257 


0.0316 


4.69 


0.4 


0.2 








0.1192 


0.6400 


7.84 


0.0250 


0.0314 


4.74 


0.6 


0.3 








0.1825 


1.1745 


8.78 


0.0221 


0.0276 


5.13 


Ai 


A2 


Oil 


a2 




J:\$r-f3r\ 




Sample 


Exact 


Opt. p 








Lasso applied to standard linear ref 


jression 


model: 






0.06 


0.03 


1 


-0.5 


5.0599 


1.9673 




0.0777 


0.3758 


1.95 


0.2 


0.1 


1 


-0.5 


3.4465 


2.3662 




0.0583 


0.1403 


2.63 


0.4 


0.2 


1 


-0.5 


2.9215 


2.0191 




0.0425 


0.0721 


2.45 


0.6 


0.3 


1 


-0.5 


3.2014 


2.2246 




0.0277 


0.0387 


1.47 


0.06 


0.03 








0.4259 


0.4259 




0.0216 


0.0285 


4.80 


0.2 


0.1 








0.4955 


0.4955 




0.0222 


0.0295 


4.24 


0.4 


0.2 








0.6580 


0.6580 




0.0228 


0.0303 


3.17 


0.6 


0.3 








1.1990 


1.1990 




0.0215 


0.0283 


1.66 



true factor dimension is k = 2. For n = p = 50 the average value of the 
estimate k determined by (4.5) is 2.64. In aU other situations reported in 
Table 1 an estimate A: = 2 is obtained in each of the 1,000 replications. 

Finally, Table 2 contains simulations results for n = 100, p = 250, and 
different values of Ai,A2,ai,a2- All columns have to be interpreted similar 
to those of Table 1. For ai = l,a2 = —0.5 suitable parameter estimates can 
obviously only been determined by applying the augmented model. For ai = 
a2 = model (6.2) reduces to a sparse, standard linear regression model. It 
is then clearly unnecessary to apply the augmented model. Both methods 
then lead to roughly equivalent parameter estimates. 

We want to emphasize that Ai = 0.06, A2 = 0.03 constitutes a particularly 
difficult situation. Then the first and second factor only explain 6% and 3% 
of the variance of Xij. Consequently, v{k) is very small and one will expect 
a fairly large error in estimating ^j^- Somewhat surprisingly the augmented 
model still provides reasonable parameter estimates, the only problem in 
this case seems to be a fairly large prediction error. 

Another difficult situation in an opposite direction is Ai = 0.6, A2 = 0.3. 
Then both factors together explain 90% of the variability of Xij, while Zij 
only explains the remaining 10%. Consequently, Di is very small and one 
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may expect problems in the context of the restricted eigenvalue condition. 
The table shows that this case yields the smallest prediction error, but the 
quality of parameter estimates deteriorates. 

APPENDIX 



Proof of Proposition 1. Define Qiji = VijVu - E{VijVii), i = l,...,n, 
^ < 3,1 <P- For any C > and e > 0, noting that 'KiQiji) = 0, we have 



P\ 



1=1 



P\ 



> e 



1 " 

■i=\ / 
1 " 



> e 



i=l 



+ QijlI{\Q^Jl\ > C) - E{Q,jiI{\Qiji\ > C)) 



> e 



1 



i=l 



>e/2 



+ P 



1 " 

-^Q^,lI{\Qijl\> C) -E{QijiI{\Q,^i\> C)) 



i=l 



>e/2], 



where /(•) is the indicator function. We have 

\Q^jlH\Q^,l\ <C)- E{QijiI{\Qiji\ <C))\< 2C 

and 

H{Q^JlI{\Qijl\ <C)- E{QijiI{\Q,^i\ < C))f) < Ya.T{VijVa) 

< (E(y4)E(vi))V2 

<Ci. 

Applying the Bernstein inequality for bounded centered random variables 
[see Hoeffding (1963)] we get 



P 



(A.l) 



1 " 

- < c) - iE(g^j7/(|Q.j/| < c)) 



1=1 



> 



< 



exp 



— e^n "I 
8{Ci + Cs/3) /■ 
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We have now 



P 



(A.2) 



1 " 

-Y.QijiI{\Qiji\ > C)-E{Qi,iI{\Qiji\ > C)) 



1=1 

n 



>e/2 



t=l 



Using Markov's inequality and (2.5) we obtain 

P{\Qijl\ >C)< P{\V,,\ > ^/C/2) + P{\Vu\ > y/cj2) 
(A.3) +Pmv^^)E{Vi)y/'>C/2) 



< 



2Ci 



+ P((E(y,2)E(I/J))V2>c/2). 



Choose e = Co\/logp/n and C = y^Con/logp, where Cq is a positive con- 
stant such that C^/^ < i ^Conbg^ and Ci < iCoe"V^o"/i°SPy/logp/n. Note 
now that 



P((E(F,2)E(F.2))V2>c/2) = o, 



while 



-.3/2 



n\Qiji\Ii\Qiji\ > C)) < (E(^,2)E(y,f))V2p(|Q I > c) < —j^^, 
which implies 

P(E(|Qi,,|/(|Q,,,|>C))>e/4) = 0. 
Inequalities (A.l), (A.2) and (A.3) lead finally to 



P\ 



(A.4) 



1 " 

-Y.Vi,Vu-nv^jVa) 



i=l 



>Co 



logp 



n 



< p-C2/(8(Ci+Co'/V3)) + 2nCie-('^/2)(n/logp)i/4^ 
The result (2.6) is now a consequence of (A.4) since 

■( 



P sup 

\l<j,l<p 



-Y^VijVii-EiVijVu) 



i=l 



logp 



p p 
i=i 1=1 



1 " 

-Y^VijVa-EiVijVii 



1=1 



>Cr 



logp 



n 



□ 
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Proof of Theorem 1. For any symmetric matrix A, we denote by 
Ai(A) > A2(A) > • • • its eigenvalues. Weyl's perturbation theorem [see, e.g., 
Bhatia (1997), page 63] implies that for any symmetric matrices A and B 
and all r = 1, 2, . . . 

(A.5) |A,.(A + B) - A,.(A)| < ||B||, 

where ||B|| is the usual matrix norm defined as 



IBI 



sup (u^BB^u) 

l|u||2 = l 



1/2 



Since = ^T + i*, (A.5) leads to - Xr\ < |||*||. By assumption, 

2 

is a diagonal matrix with diagonal entries < ^ < 7 = 1, . . . ,0. 
Therefore < ^ and (3.5) is an immediate consequence. 

In order to verify (3.6) first note that Lemma A.l of Kneip and Utikal 
(2001) implies that for symmetric matrices A and B 



||^,(A + B)-V',(A)||2< 



IBI 



(A.6) 



miuj^r |Aj(A) — Ar(A)| 
6||B||2 



where ipiiA), ^2 (A),, 
ues Ai(A) > A2(A) > • 



minj^r|Aj(A) — Ar.(A)p' 

. are the eigenvectors corresponding to the eigenval- 
• . By assumption (A. 3) this implies 



iMr - Iprh < 



v{k) 



I 6||(l/p)'^f ^ 2D2 



v{kY 



pv{k) 



for all r = 1, . . . ,k. Since Dq > K(Xfj) = Ylr=i ^rjPt^r, the second part of 
(3.6) follows from Dq > S'^jPfir, j = ^, ■ ■ ■ ,P- 

By (A. 3) we necessarily have Sr > Xr ^ v{k) for all r = 1, 



,k. Con- 



/^Ir 



sequently, 

t, _|_ ^T'^i _|_ ( Sr 

and (3.6) lead to 

lE(fer-Cf) 
1 

= —dT-^Sr 

fir P 



< 



Furthermore, note that 



Since Wi and Zi are uncorrelated, (3.5) 



+ 



br — "ll^r 



+ 



/fir 



//I^vA^ 



1 / Sr — Ipr 



P 



<^ + 2^\\Sr 



^JP + 2— (V^ 

Hr 



'fir 

,2 



+ 



/fir 



ffl^yJXr 
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i2 



< + 7r^ + 



D2 
^2 



Since 'f\ Xj = + Zj the second part of (3.7) follows from similar 

arguments. Finally, using the Cauchy-Schwarz inequality (3.8) is a straight- 
forward consequence of (3.7). □ 

Proof of Theorem 2. With A = randB = E- r = E- S + *, 
inequality (A. 5) implies that for all r G {1, . . . , A;} 



(A.7) 



I \f — At- I ^ 



-* + -(S-S) 

p p 











< 




+ 


i(S-S) 




p 




p 



But under events (2.1)-(2.4) we have 
1 



-(i]-s: 



p 



sup 

|u||2 = l 



sup 

l|u||2 = l 



< sup 

||u||2=l 



u^l(S-S)2u 



1/2 



P P / n y 

^Eii^ii'E -E^«.^-^^.'-cov(A,„A,o 

' j=i 1=1 V i=i / 



1/2 



1/2 



logp 



n 



On the other hand, ||-^'|| < — , and we can conclude that 
'lip II — p ' 



(A. 



1* 

P 



+ 



i(£-s) 

p 



p \ n 



Relation (4.1) now is an immediate consequence of (A.7) and (A. 8). 

Relations (A.6) and (A.8) together with (A.3), (A. 4) and (2.1)-(2.4) lead 

to 



:i/p)^ + (l/p)(S-E)|| 

min^y; \Xj - Xi\ 

6ui/p)^+{i/p){^-mi 
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i?2/p + Co(logp/n)l/2 



< 



< 2 



Q{D2/p + Co{\ogp/nYl^f 
D2/p + Co{hgp/ny/^ 



v{k) 

which gives (4.2). It remains to show (4.3) and (4.4). Note that the spectral 
decompositions of ^ and S imply that for all j = 1, . . . ,p 

p n p 



n ■ 

r=l i=l r=l 



Under events (2.1)-(2.4) we therefore obtain for all r < A; 
. 9 E(P^2) Do-Di Do-Di 

■' p\r pXr pv[k) 



(A.IO) V'.'^. < 



1/^) Er=i < A + Co(logp/n)i/ 



2 



But by assumptions (A. 3) and (A. 4), relation (4.1) leads to A^ > ^^^^ . Equa- 
tions (4.3) and (4.4) then are immediate consequences of (A.9) and (A.IO) 
□ 

Proof of Theorem 3. Choose an arbitrary j G {1, . . . Note that 

(PfcXj)j = Xij — X]r=i V'n''/'r"^j- Since Xij = Wij + Zij we obtain the de- 
composition 

- 5:(P,X,)| = - E K^^- - E ^r,^r^^ 
i=l 1=1 \ r=l / 

(A.ll) + E f - E ^rji^r^}\ 

i=l \ r=l J 



1 " 



n 
1=1 



n 

2 

ij- 



Under events (2.1)-(2.4), we have |cj| - ^EILi^fjl < Co?i"^/^\/Iog^ as 
well as 1^ ^"=1 ZijWijl < Con~^/'^y/logp. Furthermore, E,{ZijXij) = cr| and 
E{ZijXu) = for j / I. Therefore, 

logp 



E'^^j "E^^j^*- 

r=l \ i=l / 



<5:v5^,a| + 2CoJ^El^nl ElV^ 

r=l r=l \l=l 



rl\ 
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Obviously, X]f=i li^nl ^ y PYl^=i'^ri ~ VP- ^'^^ follows from Theorem 2 
that there exists a constant Mi < oo, which can be chosen independently of 
all values n,p,k,S satisfying assumptions (A. 3) and (A. 4), such that 

- Y,iP,x,)j - ^1 > - E - E ^n^-^x, 

" i=l ^ i=l \ r=l / 



(A.12) 



Mi- 



-(A;)V2 



k logp 



n 



> -Mi- 



)(A;)i/2 



k logp 



n 



Since events (2.1)~(2.4) have probability A(n,p), assertion (4.6) is an imme- 
diate consequence. 

In order to show (4.7) first recall that the eigenvectors of possess the 
well-known "best basis" property, that is, 

k 



1=1 1=1 



mm 



- ^^^(^7^ Xj) 

1 " 

-E 



mm 



wi,...,Wfc6MP n ^ei,...,Ok 

1=1 



r=l 



For j = 1, . . . ,p and r = 1, . . . ,k define ipl'!' G by ipl'^j = ijjrj and ■0^;'' = ''Arij 
/ 7^ J. The above property then implies that for any j 



^ 7L ^ It 

lEllP.x.||BiE 

i=l i=l 



r=l 



Since the vectors PfcXj and Xj — Ylr=i'^^r\'^'r'^i) ^^^Y differ in the jth 
element, one can conclude that for any j = 1, . . . ,p 



-, n ^ n j k 

(A-13) - Y.^^k^^)] < - E h^^^- - E V'r, (^'^X, 



i=l 



i=l 



r=l 



The spectral decomposition of S implies that SI = Ylr=iP^r'4'r'4''r with 

1 " T 1 " T 

pA, = -V(-0^Xi)^ -V(^»^X,)(^fxO = 0, s/r. 

rj ^ — ^ r7 ^ — ^ 



i=l 
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It therefore follows from (A. 13) that 



(A.14) 



t=l 1=1 r=l \ i=l / 



We obtain ¥.{Xf-) = ¥.{Wfj) + a] as well as E{XijXii) = E{WijWii) for j / 
At the same time under events (2.1)~(2.4), 



k I ^ n \ k 

E ^n^^ - E - E ^rj^m'^ii^i, 

\ i=\ ) r=l 

< iVril (e I^-h) 

r=l \;=i / 

k 

+ E l^nlK^r - V-rf E(W^iiWi)| 



+EiV'^ 



rj 1 1 •f'r j I ■ 



T = \ 



Note that E(VFijVFifc) < Z^o - for ah = By the Cauchy- 

Schwarz inequality, Theorem 2 and Assumption (A. 4), we have 

^ lOCon-V^Vb^ ^— — 

< ^(^^ ^V{B, - D,) 



as well as Ylf=i \^ri\ ^ YPSf=i V'rZ ~ V^- '^^^ bounds for iprj and V'ri de- 
rived in Theorem 2 then imply that under events (2.1)~(2.4) there exists a 
constant M2 < 00, which can be chosen independently of all values n,p, k, S 
satisfying assumptions (A. 3) and (A. 4), such that 

k / 1 " \ ^ 

E ^-^'^^ - E - E i^rji^JnwijWi] 

r=l \ i=l / r=l 
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At the same time, by Theorem 2 it follows that there exist constants M|, < 
oo such that and r = 1, . . . ,k, 



(A.16) 



\p\r — pXr \ < M2pn ^^'^y/logp, 

MS* 



v{k) 



n 



Note that E(('0^Wj)^) =pXr. Under events (2.1)-(2.4), we can now con- 
clude from (A.14)-(A.16) that there exists a constant M^** < oo, which can 
be chosen independently of all values n,p, k, S satisfying Assumptions (A. 3) 
and (A. 4), such that 



1 " ^ 



i=l 



< 



(A.17) 



r=l 



r=l 



k 



: + E((P,.Wi)^) + Mr*-T^n-i/2yi^. 



Relations (A.12) and (A.17) imply that under (2.1)-(2.4) 



1 " / ^ ^ ^ 



(A.18) 



4 = 1 



r=l 



<E((PfcW,)^) + M2* 



,{kf/' 



-n 



-1/2 



holds with M| < Mi + M^**. Since events (2.1)-(2.4) have probability 
assertion (4.7) of Theorem 3 now is an immediate consequence of (A.12), 
(A.17) and (A.18). 

It remains to show (4.8). We have 



i=l \r=l / 
^ n k 

(A.19) <al,^-Y,Y.^lr-iir 



i=l r=l 
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k ^ n / ^rp_^ \ 2 



V, r=l i=l 



r=l i=l 



But for all r = 1, . . . , 



-E 



(A.20) 



1 y pAr- 



< 



n ^ 



i=l 



pXrXr 



^2 " ((y.-^'.rx,)^ 

n ^ 



i=l 



pXr 



and Theorem 2 and assumptions (A.1)-(A.4) imply that under events (2.1)- 
(2.4) there exist some constants M^,M^* < oo, which can be chosen inde- 
pendently of all values n,p, k, S satisfying assumptions (A. 3) and (A. 4), such 
that 



(A, - A,)' 



i=l 



pXrXr 



< 



{\J Xr + y/K-^Xs 



A/g lOgP 

f(A:)2 n 



and 



(A.22) 



i,^((V^,-V'.)'Xi)% 1 



pXr 



pXr 



•~ E ll-Xilli 



i=l 



2 ^ Ml* logp 



i=l 

hold for all r = 1, . . . ,k. 

Now note that our setup implies that ^ Ya=i ^( C^^^ )^) — ^-^d 
Var(i ^^^,(^)2) < hold for all r = 1, . . . , fc. The Chebyshev in- 

equality thus implies that the event 



all r = 1, 



holds with probability at least 1 — -. We can thus infer from (A.26)-(A.22) 
that there exists some positive constant M3 < 00, which can be chosen in- 
dependently of the values n,p,k,S satisfying (A.3)-(A.5), such that under 
events (2.1)-(2.4) and (A.23) 

1 " / ^ V 

-E E(^^--^'^-)«'- 



i=l \r=l 



< at 



/ Ak{MI+MI*)logp ^ 2k{D2 + yjD^ + 2D. 
\ v{kY n 



+ 



pv{k) 
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Recall that events (2.1)~(2.4) and (A. 24) simultaneously hold with probabil- 
ity at least A{n,p) — {p + k)~^ while (A. 23) is satisfied with probability 
at least 1 - |. This proves assertion (4.8) with M3 = 4{M^ + M|*) + 2{D2 + 
y/Ds + 2D'i). □ 

Proof of Proposition 2. Let denote the p x p diagonal matrix 



with diagonal entries 1/y ^ Yl?=i(Pk'^l)i^ • • • > l/y n Yl?=i(Pk^l)l ^^'^ split 
the (A; + p)-dimensional vector A in two vectors Ai and A2, where Ai is 
the fc-dimensional vector with the k upper components of A, and A2 is the 
p-dimensional vector with the p lower components of A. Then 

A^- V A = Af Ai + Ai^- V QfcPfeX,Xf PfcQfcAs 
n ^-^ n ^-^ 

i=\ i=\ 

> Af Ai + A^QfcPfc*PfcQfcA2 
+ Al^QfePfe(S-5])PfcQfeA2. 
The matrix is a diagonal matrix with entries af, . . . ,ap, and t/^J^tpg < D2 

for all r,s. Together with the bounds for tp^j derived in Theorem 2 we can 
conclude that under (2.1)-(2.4) there exists a constant M| < 00, which can 
be chosen independently of all values n,p,k,S satisfying assumptions (A. 3) 
and (A. 4), such that 

Ai^QfcPfc*PfcQfcA2 

k 

= A^Qfc*Qfc A2 - 2 ^ Ai^Qfe^;,^.^*Qfe A2 



+ E E Qfc Al^V'.^'r ^^'s^'^Qfe A2 

r=l s=l 



2 



^ , Di 2fcM| + fc^MI , 

- ,max,((l/n)Er=i(P;^X02) pv{k)mm,{{l/n)Etii^k^l)]) 



We have 

: i X:(P'^X^)| < I E 4- < ^0 + max i ^ X^. - E(X 



max - 

1=1 1=1 



n 

i=l 



and since Di < Dq, this leads under (2.1)-(2.4) to 
Af Ai + Al^QfeP^.*PfcQfc A2 

^ / Di 2kMl + fc^MI . 2 

- V^'o + Coni/^Vb^ pz;(fc)min,((l/n)Er=i(PfeXz)2; 
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On the other hand 
Al^QfcPfc(S-I])PfcQfcA2 

= ^ljo,,^s^kPk{^ - ^)PkQkA2,Jo.,+s 

where A2,jo ^^g, respectively, A2 jc^^^^i is the p-dimensional vector with the 
last p coordinates of A ^.^^ , respectively, A jc^ ^ . The Cauchy-Schwarz in- 
equality leads to ||Ajq ^.^^ 

-^o.fe+slb- Since HAjc^^^Hi < 

co||Aj(, ,^^g||i we have 

I Ai;,„,,,,Q.P.(£ - i:)P,Q,A,^jc^J 

<max|(QA(S-5])PfeQfc)^.J||Aj„^,^J|i||A,c^^y^ 

< comax|(QfcPfc(S - S)PfcQfc) • ,| || Aj„_,^ J|? 

< 2{k + 5)comax|(QfcPfc(S - ^)PkQk)u\\\Aj, 

and the same upper bound holds for the terms A^j^ ^^^QfcPfc(I] — S)PfcQfc x 
A2,jo,,+g and A^ c QfcPfc(£ - S)PfcQfcA2 jc so that 

Al^QfcPfc(S-5])PfeQfcA2 

< 8(A; + 5)comax|(QfcPfe(S - S)PfeQfc) ■ J || Aj^ 

Obviously, - E)^^ < pmaXj^i\^YA=i^i,j^i,l ~ Gov{Xij,Xi^i\ for all 

r,s. Using Theorem 2, one can infer that under (2.1)-(2.4), 

max|(QfcPfc(S-S)PfcQfc).J 

j,i 

min,((l/n)Er=i(PfcX02) 
+ 2j]max|(^.,,^.^(S-5])) J 
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k k 

r=l s=l ' 

min,((l/n)Er=i(P;^X,)2)' 



where the constant < oo can be chosen independently of ah values 
n,p, k, S satisfying assumptions (A. 3) and (A. 4). When combining the above 
inequalities, the desired result follows from (A. 5) and the bound on 
mmi - X;r=i(PfcXi)| to be obtained from (4.6) □ 

Proof of Theorem 4. The first step of the proof consists of showing 
that under events (2.1)-(2.4) the following inequality holds with probability 
at least 1 - (p + A;)i-^'/8 



(A.24) 



1 



n 



< 



where p = Aa 



log(fc+p) _^ Ms 



^(fc)3/2 Y A>2y/2 and M5 is a sufficiently 
large positive constant. ^ 

Since Wij,Zij and, hence, and Xij are independent of the i.i.d. error 
terms ~A/'(0,cr^), it follows from standard arguments that 

n 



(A.25) sup 

l<r<fc,l<j<p 



i=l 



1=1 



< Aa 



log{k+p) 



n 



holds with probability at least 1 — {p + k)^~^^^^. Therefore, in order to prove 
(A.24) it only remains to show that under events (2.1)-(2.4) there exists a 
positive constant M5 < 00, which can be chosen independently of the values 
n,p,k,S satisfying (A.3)-(A.5), such that 



(A.26) sup < 

l<r<k,l<j<p 



1=1 



i=l 



I] Si 



< 



v{kf/^ V n 



We will now prove (A.26). For all r = 1, . . . , A; we have 

k 



i=l 



1=1 

n 



1=1 s=l 
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n 

=1 i=i 



Tr 



< Osum sup 



1 " (Va.-^/aI)^,x, 



i=l 



+ 



1 " ^ 

n ^-^ 



'pXsXs 



+ 



1=1 



fpXs 



Using the Cauchy-Schwarz inequality and the fact that ^Y^^=i^r ~ ™" 
equahties (A. 21) and (A. 22) imply that under events (2.1)-(2.4), one obtains 



1 " ^ 

n ^ — ^ 

<asumSUp( ( 



(A.27) 



j=i 



1/2 



+ 



< as 



t;(A;) V n v{k)y^ 



n 



JT- / Try 



1 ^ 

-Ee 

i=l 



rpXs 



Since also ^ XlILi -^fj — 1' similar arguments show that under (2.1)-(2.4) 



(A.28) 



1 " ~ ^ 

-E^'J^* 

'i=i 



M* /logp ^ yiff ^ /logp 



w(A;) V n f(A:)3/2 v n 
+ sup 



for all j = 1, . . . ,p. The Cauchy-Schwarz inequality yields ^f^^ \ 4'ri \ < 
Yh=i iV'rd < V^i as well as 



E 



r=l 



p / k 



A EE^^j"^'*jE'^'-''^^' 

\ r=l s=l 1=1 



< y fcpsuplV'rjl- 
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Necessarily, v{k) < Dq/U and hence k < DQ/v{k). It therefore follows from 
(4.3), (4.4), (4.6) and (A. 5) that under events (2.1)-(2.4) there are some 
constants Mg**, M^** such that for all r,s = 1, . . . ,k and j = 1, . . . ,p, 



n 



i=l 



1 



n 



1 



(A.29) 



and 



(A.30) 



1 pyXrXs i=ii'=i 



< 



v{k) V n 



n ^ 



1=1 



^hih ((l/^)Er=l(PfcX.)|)V2^ 



< 



M5*** 

i;(A;)3/2 



logp 



n 



Result (A. 26) is now a direct consequence of (A.27)-(A.30). Note that all 
constants in (A.27)-(A.30) and thus also the constant M5 < 00 can be chosen 
independently of the values n,p,k,S satisfying (A.3)-(A.5). 

Under event (A. 24) as well as K„^p(fc,5, 3) > 0, inequalities (B.l), (4.1), 
(B.27) and (B.30) of Bickel, Ritov and Tsybakov (2009) may be transferred 
in our context which yields 

(A.31) \\{d-e)jj2<^pVkTs/K^, \\d-9\\i<A\\{d-e)jj,, 

where Jq is the set of nonnull coefficients of 6. This implies that 



(A.32) 



^ ^ ~ ~ k + S 

\ar - 5r| + ^ - < 16 — —p. 

r=l j=l 



Events (2.1)-(2.4) hold with probability A(n,p), and therefore the prob- 
ability of event (A. 24) is at least A{n,p) — {p + k)^~'^^^^. When combining 
(4.4), (4.6) and (A.32), inequalities (5.2) and (5.3) follow from the definitions 
of /3j and 5^, since under (2.1)-(2.4) 

Vl^--/3-| = V \^3-f^j\ 

U " ,tt((i/^)Er=i(PfcX,)|)V2 



< 



{Di - A/i(A;n-V2^b^/^;(A;)V2))i/2 
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and 



Y^\ar-ar\ = Y, 



r=l 



r=l 
k 



ar - y pK ^ VVi - 13 j ) 



It remains to prove assertion (5.4) on the prediction error. We have 

2 



^ n / k ^ p ^ \ 

- ^ I ^ 6rSr - ^irOr + ^ Xij 0j - Pj) \ 
i=l \r=l j=l / 



n 



(A.33) 



j=l \r=l 
n / k 



r=l 



= 1 \r=l 



2 



i=l \r=l 



Under event (A. 24) as well as K„^p(/c, 5, 3) > 0, the first part of inequalities 
(B.31) in the proof of Theorem 7.2 of Bickel, Ritov and Tsybakov (2009) 
leads to 

/ h n \ 2 



(A-34) ^E(EM5.-5.) + J:x.,(^,-^,)) <^^^P^ 



i=l \r=l 



Under events (2.1)-(2.4), (A. 24) as well as (A. 23), inequality (5.4) now fol- 
lows from (A.33), (A. 34) and (4.8). The assertion then is a consequence of 
the fact that (2.1)-(2.4) are satisfied with probability A(n,p), while (A. 24) 
and (A. 23) hold with probabilities at least 1 — {p + k)~^ and ^ — ^, re- 
spectively. □ 
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